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Abstract 

Identification of important nodes in complex networks has attracted an increasing attention over the last decade. Various 
measures have been proposed to characterize the importance of nodes in complex networks, such as the degree, 
betweenness and PageRank. Different measures consider different aspects of complex networks. Although there are 
numerous results reported on undirected complex networks, few results have been reported on directed biological 
networks. Based on network motifs and principal component analysis (PCA), this paper aims at introducing a new measure 
to characterize node importance in directed biological networks. Investigations on five real-world biological networks 
indicate that the proposed method can robustly identify actually important nodes in different networks, such as finding 
command interneurons, global regulators and non-hub but evolutionary conserved actually important nodes in biological 
networks. Receiver Operating Characteristic (ROC) curves for the five networks indicate remarkable prediction accuracy of 
the proposed measure. The proposed index provides an alternative complex network metric. Potential implications of the 
related investigations include identifying network control and regulation targets, biological networks modeling and 
analysis, as well as networked medicine. 



Citation: Wang P, Lu J, Yu X (2014) Identification of Important Nodes in Directed Biological Networks: A Network Motif Approach. PLoS ONE 9(8): el 061 32. doi:10. 
1 371 /journal.pone.01 061 32 

Editor: Mark D. McDonnell, University of South Australia, Australia 

Received September 4, 2013; Accepted August 3, 2014; Published August 29, 2014 

Copyright: © 2014 Wang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits 
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 

Funding: This work was supported by the National Science and Technology Major Project of China under Grant 2014ZX10004-001-014, the 973 Project under 
Grant 2014CB845302, the National Natural Science Foundation of China under Grants 61304151 and 61025017, the Science Foundation of Henan University under 
Grants 2012YBZR007 and 2013YBRW005. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the 
manuscript. 

Competing Interests: The authors have declared that no competing interests exist. 
♦ Email: wp0307@126.com 



Introduction 

Complex network theory and its applications have been popular 
topics in recent years [1-8]. Many real-word systems can be 
described by complex networks and investigated through complex 
network theory, such as social systems, biological systems. Gene 
regulatory networks, signal transduction networks, neural net- 
works, protein-protein interaction networks, metabolic networks 
are typical biological networks, which have been extensively 
investigated in related areas [9-15]. 

Complex networks consist of nodes and edges. An edge denotes 
the interaction between two nodes, which can be directed or 
undirected. Many biological networks are directed ones. For 
example, in gene regulatory networks, nodes represent genes or 
transcription factors, edges represent the interactions between 
transcription factors and the regulated genes, or between 
transcription factors. Over the last decade, identification of 
important nodes in complex networks has been an intriguing 
topic [16-32]. For example, in social networks, provided that one 
knows which nodes are the most important ones, one can control 
these nodes in priority to prevent the spread of infectious diseases 
[16]. However, it is still a challenge to determine which nodes are 
important in a complex network. Traditionally, degree is 
frequently used to characterize the importance of a node [1— 
3,6-8,16]. The other indexes include the betweenness [19], 
closeness [1], k-shell [7], principal component centrality [17] 



based on adjacency matrix of the network, semi-local centrality 
[20], motif centrality [25-30], PageRank [21] and others therein. 

For undirected networks, some researchers believe that the most 
connected nodes are the most influential [1-3]. But recently, 
Kitsak et al. [3 1] investigated the spreading dynamics on four real- 
world complex networks. They found that for networks with a 
single initial spreader, k-shell can predict the outcome of spreading 
dynamics more reliably than degree and betweenness. Following, 
Chen et al. [20] proposed a semi-local centrality, which considers 
the degrees of both the nearest and next nearest neighbors of a 
node. The semi-local centrality can more effectively characterize 
influential spreaders in complex networks than the degree and 
betweenness. Recently, following the method in [31], we identified 
influential spreaders in artificial random, small-world and scale- 
free networks. Some general conclusions have been obtained [32]. 

However, though there are numerous results reported on 
undirected networks, few results have been reported on directed 
biological networks [25-30]. In 2004, Sporns et al. [27] proposed 
a concept of motif fingerprint in brain networks, which counts the 
appearances of each node in network motifs with a given size as a 
measure. In 2007, based on the motif fingerprints and some of the 
other centrality measures, Sporns et al. [28] investigated the 
identification and classification of hubs in some brain networks. 
Also in 2007, based on the concept of network motif, Koschiitzki 
et al. [25,26] proposed some new motif-based measures for gene 
regulatory networks. They took the occurrences of each node in 
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the 3-node feed-forward loop (FFL) as a measure, after further 
considering the direction of each edge, another two extended 
measures were proposed. Interesting results on finding the global 
regulators in the gene regulatory network of E. coli have been 
reported. 

In this paper, based on the occurrences of each node in all 2- 
node, 3-node and some 4-node network motifs and the PCA, we 
aim at developing a new method to characterize node importance 
in directed biological networks. To evaluate the performance of 
the new index, the in-degree, out-degree, total degree, PageRank, 
motif centrality and betweenness are considered to compare with 
the proposed one. Investigations on five real-world biological 
networks illustrate the performance of the proposed measure. 

Materials and Methods 

Network Motifs and Motif Detection 

In 2002, Alon et al. proposed the concept of network motif, 
which is defined as subgraph that appears in a network 
significantly more than in randomized ones [9,11-13]. Network 
motifs are building blocks of complex biological networks [11]. 
Functions of some motifs have been extensively investigated. For 
example, for the FFLs, researchers have theoretically and 
experimentally found its functional and structural advantages 
[33-37]. Two-node motifs include the double negative feedback 
loop, double positive feedback loop, and that with auto-activation 
or repression loops [38,39]. Three-node motifs include the FFLs, 
the repressilator and so on [39,40], with some of them as shown in 
Fig. 1. Fig. 1 (a) shows the Drosophila developmental transcrip- 
tional networks with 119 nodes and 306 directed edges [13]. 
Fig. 1(b) shows some representative 2, 3 and 4-node motifs. The 
motifs in Fig. 1(b) is coded as Mj, where the subscript i denotes 
the size of the motif, j is the motif ID number, which is a decimal 
number that transformed from the adjacency matrix of the motif 
(For details, one can refer to Mfinder tool guide [41]). 



To detect network motifs, Milo et al. [1 1] scanned all possible i- 
node subgraphs in a network and its randomized counterparts, 
and defined network motifs as subgraphs for which the probability 
of occurrences in the real network are greater than that in 
randomized ones. Since 2002, many motif detection algorithms 
and softwares [41-46] have been developed. For example, gSpan 
[43], Mfinder [41], FANMOD [45], Mavisto [46] and mDraw 
(http://www.weizmann.ac.il/mcb/UriAlon). In the following, we 
use mDraw to detect network motifs. For each network, we 
generate 1 00 randomized networks. Number of a subgraph in the 
real-world network is denoted as N Tea i. The average number in 
1 00 random networks is denoted as iV ran[ j, with standard deviation 
denoted by Sj. The Z scort . measures the significance of the 
subgraph [11], which is defined as Z score = (W re . a i— N Ilm d)/Sd. 
Another index U is defined as the number of times a subgraph 
appears in the investigated network with distinct sets of nodes. In 
this paper, subgraphs with Z scorl ,S:2, C/S4 and iV real — A^ rand ^ 
0.1iV ran d are identified as motifs. 

A new measure of node importance based on network 
motifs 

Based on network motifs, we develop a new measure to 
characterize node importance in directed biological networks. 
Biological networks consist of some motifs, which act as functional 
units of the complex networks. For example, it has been found that 
the FFLs play functional roles in gene regulatory networks, such as 
an incoherent FFL can act as a fold-change detector [9,35]. Some 
other 3-node motifs and the 4-node bi-fan motif M 4 204 are also 
found to play functional roles in biological systems [9,14]. 
Therefore, nodes that frequently involved in network motifs may 
be more important. If a node involves in several different types of 
network motifs, then this node may potentially have multi- 
functional roles. Keeping the idea in mind, some related measures 
have been proposed to investigate the biological networks [25-30], 
We noted that in some works, network motifs are treated as 



M-,6 




(a) 



(b) 



Figure 1 . A real-world biological network and some network motifs, (a) A Drosophila developmental transcriptional network with 1 1 9 nodes 
and 306 directed edges, (b) Some representative 2, 3 and 4-node motifs. 
doi:1 0.1 371 /journal.pone.01 061 32.g001 



PLOS ONE | www.plosone.org 



2 



August 2014 | Volume 9 | Issue 8 | e106132 



Important Nodes In Directed Biological Networks 



subgraphs, such as the works of Rubinov et al. [30] and Wuchty 
et al. [42]. 

Hereinafter, different from the works in [25-30], based on all 2, 
3 and some 4-node motifs in directed networks, we propose a new 
integrative measure. Specifically, suppose we have a directed 
network with n nodes, and there are totally m types of 2, 3 and 4- 
node motifs. We denote the occurrences of node i in thej-th type 
of motif as Uy, i= 1,..., n,j= 1,... m. Then, we derive a matrix 
A = (UiJ) n xm for the network. In real-world networks, the impor- 
tance of different types of motifs are varied. Therefore, we endow 
each motif with a weight Wj,j = 1 , 2, . . . ,m, where Wj = Cj/ Ylk= l c fc> 
here, c h (k = 1, 2, ...,«?) denotes the number of the k-th type of motif. 
Subsequently, we derive a revised matrix 
B = (bij)„ x m = (bi ,b 2 ,:;b m ) = (WjUij)„ xm . 

Based on B and the idea of the PGA [47-49] , we construct the 
following index to obtain node importance score: 

m 

/ score = 5>A> (i) 

7=1 

where a = (ai,a2,...,a„,) T are parameters to be determined. The 
best index vector F core should have high distinguish ability among 
different nodes. Therefore, the variance of r"" re should be as large 
as possible. Taking B\,...,B m as random variables, which represent 
the weighted counts of a node in the m types of motifs. For a 
certain network with size n, the nxm matrix B = (b u b 2 ,..., b m ) is 
an observation matrix of the m dimensional random vector 
B = (Bi,B2,---,B m ) T . The covariance matrix of B can be estimated 
by its observation matrix B. Denote the covariance matrix of B as 
2, then 



Var(S)« Var(73) = S= -(B T B-nBB T ), 

n— 1 

where B is the column mean vector of B, n is network size. It is 
noted that 2 is just the unbiased estimator of Var(£>) [48]. Based 
on the above notations, we have a stochastic form of r" yre as 

jscore =a ,T B Thg variance of jscore caQ bg estimated by 



Var(T score ) = Var (tx T B) = x T Var (B)ot « a T 2a. 

To determine the unique optimal vector oc, we restrict tyJa = 1 . 
Thus, a can be determined through the following constrained 
extremal problem: 



max a T 2a 



s.t. a T a=l. (2) 

To solve the optimization problem (2), by the Lagrangian 

multiplier method, we construct the following Lagrangian 
function. 



L(a,X) = x T ?,x-X(x T a-\). (3) 

And let 



^=2(2-lE> = 0, 

t T (4) 

where E is the identity matrix. It follows from Eq.(4) that X and a 
are just the eigenvalue and eigenvector of matrix 2. Under Eq.(4), 
Var(J score )«a T Za = lo( T a = /t. Therefore, the optimal X and a 
are just the biggest eigenvalue and the corresponding unit 
eigenvector of 2. Denote the eigenvalues of 2 as 
X\ >X 2 > ... >X m >0, then the optimal X = X\. From the theory 
of the PC A, the ratio X\/Y^i= \ h can reflect the contribution of 
X SC0K , or how much information in B can be extracted by X icore . 

So far we have determined a. For a concrete network, replacing 
bj in Eq.(l) with concrete values, one determines the observation of 
Z* LOK as r"'". Finally, the nodes in the network can be ranked 
according to 7 s '""'. Nodes with larger 7 s "'"' values are more 
structurally important. Based on P'""' and some well-defined 
distances, such as the well-known Euclidean distance, the n nodes 
can be classified into several clusters, where nodes in the same 
cluster are similarly important. 

To sum up, for a network with n nodes, the procedures of the 
proposed measure are as follows. 

1) Detect 2, 3 and 4-node network motifs in the network. 

2) Count the occurrences of each node in m types of motifs, and 
derive a nxm matrix A. 

3) Perform data processing on A, such as weighting and 
standardizing matrix A, then we obtain a matrix B. Compute the 
covariance matrix 2 of B. 

4) For 2, compute the biggest eigenvalue X and the 
corresponding unit eigenvector a. 

5) Compute / store according to (1) and rank the n nodes 
accordingly. 

An illustrative example 

To illustrate the procedures of the proposed method, we give a 
simple example. The simple artificial network contains 6 nodes, 
and the topology of the network is shown in Fig. 2 (a). Suppose 
there are three motifs in the network, namely, M 3 38, M3 108, M 2 6, 
as shown in Fig. 2(b). Fig. 2(c) lists the members of the three 
motifs. Occurrences of nodes in each motif are summarized in 
Fig. 2(d). As we see, the occurrences of M 3 38, M3IO8, M 2 6 are 8, 
2 and 2, respectively. Therefore, the weights of M :i 38, M3IO8, 
M 2 6 are a>\=2/3>, (02 = 1/6, (03 = 1/6. Subsequendy, we derive 
matrix B and its covariance matrix 2. 



B={b\,bi,h) = 
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1.7778 0.0667 0.0000 
0.0667 0.0111 0.0111 
0.0000 0.0111 0.0185 



The eigenvalues of 2 are X\ = 1.7803, k 2 =0.0257, +, =0.0014, 
and the unit eigenvector corresponding to k\ is: 



a = (0.9993, 0.0377, 0.0002) T 



Thus, we have 



= 0.9993foi +0.03776 2 + 0.0002fc 3 



(5) 



The contribution of is k x Km + k 2 + kj) = 98.50%. That is, 
98.50% information that contained in bi, b 2 , b 3 can be extracted 
by f"' rl \ Therefore, f ™ c can optimally rank the 6 nodes. 
Substitute b\, b 2 , b 3 in matrix B into Eq.(5), we have 

/SCO re = (4.0034,0.6725,4.0098, 1.9986,3.3372,2.0049) T . 



From p 00 ^ the third value is the biggest. Therefore, we can 
judge that node v 3 is the most important one, and then V\, the least 
important node is v 2 . If one simply considers the total occurrences 
of a node in all the motifs, then v 2 and v$ would be treated as 
equally important. Whereas, from the proposed method, W4 is 
more important than v 2 , which is reasonable in that the 
occurrences of M 3 38 are significantly more frequent than the 
other motifs. Based on F"" rc and through cluster analysis, the six 
nodes can be classified into three clusters, where Vi, 1)3, v 5 are 
members of the most important cluster; « 4 , w 6 are members of the 
less important cluster; while v 2 is the single member of the 
unimportant cluster. 



Data descriptions 

The five real-world biological networks include the C. Elegans 
Neural (CEN) network [50,51], the E. Coli Transcriptional (ECT) 
regulatory network from the RegulonDB database [52], the Yeast 
Transcriptional (YT) regulatory network [53], the Drosophila 
Developmental Transcriptional (DDT) network [13], and the 
Human Signal Transduction (HST) network [13]. We note that 
the investigated networks are with high quality and have been 
frequently used as models to detect network motifs [9,1 1-13]. 

Simple statistical indexes for the five networks are summarized 
in Tab. 1 . Numbers of nodes for these networks range from 1 19 to 
1706. Numbers of edges range from 306 to 3870. The five 
networks are with abundant network motifs, such as the FFL 
M 3 38, M 3 46, the bi-fan M 4 204. It is noted that, we have 
considered all 2, 3-node motifs, but for simplicity, we have only 
considered three 4-node motifs: M 4 204, M 4 328 and M 4 904. There 
are totally 199 connected 4-node subgraphs, and there are many 
4-node motifs in the five networks. For example, in the CEN and 
ECT, there are seven 4-node motifs. Since the bi-fan M 4 204 and 
the bi-parallel M 4 904 have been frequently investigated under 
various context [9], they are common motifs in many different 
real-world networks [1 1], and the 4-node chain M 4 328 may play 
crucial roles in signal transduction pathways, we will only consider 
these three 4-node motifs. From Tab. 1, the CEN has the most 
abundant of motifs. Subgraph M 2 6 is only a motif in the CEN and 
ECT, and the actually numbers are 233 and 10, respectively. The 
M 4 328 is only a motif in the HST, the actual number is 1570. 
There are no 3-node motifs in the HST. Whereas, for most of the 
networks, the FFL and bi-fan are motifs. The YT only consists of 
the FFL and bi-fan. 

Results 

Identifying important nodes in the five networks 

Following the procedures as the illustrative example, one can 
obtain the order factor for each network. Noted that the 
occurrences of different motifs have different order of magnitude, 
we have performed standardized transformations to matrix B. 
Moreover, we denote the columns of matrix B as the vector b,j, 
where i and j have the same meaning as that in Mj. 
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Figure 2. An illustrative example, (a) A simple network with six nodes, (b) Subgraphs that are assumed to be motifs in network (a), (c) Members 
that compose the three types of motifs, (d) Appearances of nodes in each motif as shown in panel (b). (e) Frequency histograms for the six nodes, (f) 
Cluster analysis reveals that the six nodes can be remarkably classified into three classes. v q , v 3 , v 5 are the most important nodes, and v 2 forms the 
least important group, \/ 4 , v 6 form another group, which is more important than v 2 . 
doi:1 0.1 371 /journal.pone.01 061 32.g002 
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Table 1. Statistical indexes for the five directed biological networks. 
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YT 


Node 


280 


1706 


119 


227 


685 


Edge 


2194 
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29535 




280 
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"-" denotes no such item. 
doi:10.1371/journal.pone.0106132.t001 



The / s<ore for the five networks are obtained as follows. 

= 0.26546 2 6+0.38662> 3 38 + 0.3924M6 

+ 0.3901A 3 108 + 0.36586,1 10 + 0.3285fo 3 238 (6) 

+ 0.32806 4 204 + 0.3529^904. 



I£™ =0.37786 2 6 + 0.0.52506 3 38 + 0.5352M6 
+ 0.54346 4 204. 



(7) 



=0.5585/3 3 38 + 0.5721/3 3 46 + 0.6006/3 3 108. (8) 



I*™* = 0.58416 4 204 + 0.56966 4 328 + 0.57826 4 904. (9) 



Fig. 3 shows the dendrogram for the top- 30 nodes of the five 
networks. We can see that these nodes can be roughly classified 
into three or four groups, detailed information of the top-30 nodes 
in the CEN, ECT, YT and their corresponding rankings by the 
other methods are summarized in Tabs. 2-4. The corresponding 
information for the DDT and HST are shown in Table SI and S2. 
In each table, we have shown the in and out-degree as well as their 
rankings by the other methods. Here, -ft to tai is based on the total 
degree, R p is based on the PageRank, R mc is based on the motif 
centrality, and i? bet is based on the betweenness. The motif 
centrality only considers the FFL, since there are no such motif in 
the HST, it fails to work in the HST. For each network, the last 
group contains the largest amount of nodes, while the most 
important group G\ contains only one to three nodes. From Fig. 3, 
for the five biological networks, only a few nodes are far more 
important than the others. There are clear hierarchical structures 
in these networks, which indicates that the proposed measure may 
also act as an effective hierarchical index. 



/^° re =0.7071/j 3 38 + 0.7071/3 4 204. (10) 

Replacing b j with concrete values in matrix B for each network, 
one obtains the importance score for each node. Average r" yle - 
values for the five networks are shown in Tab. 1. Based on 7 score ) 
we can characterize the node importance and classify the nodes for 
each network via cluster analysis. The basic idea of cluster analysis 
is as follows [48]. According to p COTe t the Euclidean distance 
between any two nodes can be obtained. Firstly, two nodes with 
the shortest distance are merged as one group, each of the rest 
nodes forms a group. Then, one merges node groups via the single 
linkage method, until all nodes are finally merged into one cluster. 
This cluster processes can be mimicked by a dendrogram. From 
cluster analysis, one can classify nodes into groups, with similar 
important nodes in the same group. Furthermore, from the 
dendrogram, one can intuitively get some knowledge about the 
structural features of the network. 



Functional characteristics of the top-ranked nodes 

In the following, for the CEN, ECT and YT, we discuss whether 
the identified structurally top-ranked nodes are functionally 
important. 

For the CEN, the identified top-30 nodes are shown in Tab. 2. 
The top-7 nodes are AVER, AVBR, AVEL, AVDR, AVDL, 
AVBL and AVAL, which are all command interneurons. 
Additionally, the AVAR, PVCL and PVCR are another three 
command interneurons, which are all top-ranked. The AVAs, 
AVBs, AVDs, and PVCs are four bilaterally symmetric interneu- 
ron pairs with large diameter axons that run the entire length of 
the ventral nerve cord, and providing inputs to the ventral cord 
motor neurons. The AVAs locate at the lateral ganglia of head of 
the C. elegans, functioning as driver cell for backward locomotion 
[54]. The AVEs can drive backward movement of the animal 
along with AVAs, AVDs and A-type motor neurons [54]. The 
AVDs function as touch modulator for backward locomotion 
induced by head-touch. The PVCs are ventral cord interneurons, 
a harsh touch defect can be caused in the absence of PVC neurons 
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Figure 3. Cluster analysis for the identified top-30 nodes in the five networks based on the f core . 

doi:1 0.1 371 /journal.pone.01 061 32.g003 



[54]. From Tab. 2, the AVER has the largest T 500 " value 54.90, 
the in and out-degree of AVER are 33 and 18, which are not the 
largest. However, from our investigation, the AVER is the most 
important nodes in the CEN, which demonstrates that the P"'"' is 
different from the degree measures. The PageRank fails to identify 
most of the command interneurons as even among the top-50 
level. The betweenness ranks many of the command interneurons 
out of the top-20 level. The results for the CEN indicate F' ""' can 
help to identify the actual important nodes. 

For the ECT, the identified top-30 nodes are shown in Tab. 3. 
In 2003, Martinez-Antonio et al. [55] identified global regulators 
in an ECT network. There are 1 8 global regulators in the network, 
namely, CRP, IHF, FNR, fis, arcA, lrp, hns, narL, ompR, fur, 
phoB, cpxR, soxR, soxS, mlc, cspA, rob, purR. Among which, the 
CRP, FNR, IHF, fis, arcA, narL, lrp are seven key regulators, 
which can regulate the expression of 51% of genes in E. coli [55]. 
From i score ! eight of the top- 12 nodes (CRP, FNR, arcA, IHF, fis, 
narL, lrp, fur) are global regulators. The in-degree ranks most of 
the eight global regulators at the tail. The out-degree and total 
degree rank most of the eight global regulators at the top- 10 level. 
According to the PageRank, motif centrality and betweenness, 2, 1 
and 3 of the identified top-ranked global regulators are out of the 



top- 10 level. The global regulator CRP is the most important 
nodes, which represents the cAMP receptor protein. The CRP can 
regulate cAMP, and genes regulated by the CRP are mosdy 
involved in energy metabolism [56] . The CRP has the largest out- 
degree 496. But its in-degree is only 1. Though 280: csgE has the 
second largest in-degree 12, it is not top-30 ranked. From Tab. 3, 
the top-30 nodes can be classified into four clusters. The 
unimportant cluster contains the largest amount of nodes. The 
first three clusters are almost all global regulators. The observa- 
tions from the ECT indicate that the proposed measure can help 
to find global regulators. 

For the YT, the top-30 nodes are shown in Tab. 4. STE12 and 
TEC1 are two most important nodes, with the f"" re values 489.54 
and 482.02, with the out-degree 71 and 44, and with the in-degree 
both 0. STE12 and TEC1 are two transcription factors. It has 
been reported that the STE 1 2 controls two distinct developmental 
programs of mating and filamentation, therefore, it is a key 
regulator of cell fate determination [57]. Although the TEC1 gene 
has been reckoned as involving in the activation of expression of 
Tyl and the adjacent genes, it is not essential in the control of 
mating or sporulation processes [58] . It is intriguing to clarify why 
TEC1 is so frequently involved in network motifs and acts as 
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building blocks of the YT network. From the results of the out- 
degree, total degree, PageRank, motif centrality and betweenness, 
most of the nodes in G 4 are equally important, and thus have great 
differences from p core . 

Performance evaluation based on ROC curves 

To evaluate the performance of / score ) W e perform ROC 
analysis. ROC curve is frequentiy used to evaluate the perfor- 
mance of a new test in the field of signal processing and medical 
diagnostic tests [59]. For a network with n nodes, the procedures 
of ROC analysis are as follows. Suppose the nodes can be classified 
into two groups: important and unimportant, and we know the 
actual classification. For a new index, the n nodes are with values 
in the interval [a, b], for any threshold value Te[a,b], one can 
reclassify the n nodes into two classes. Comparing the actual 
classification with the new classification, several indexes can 
measure the accuracy of the new index, which are defined as 
follows [59]. 



Pi = 



n 2 



n 2 + n 4 



Ml +«3 



Ml +«4 



(11) 



(12) 



(13) 



where n 2 denotes the number of false positive nodes, which are 
considered important in the new classification but actually 
unimportant, gives the number of true negative nodes, where 
the nodes are both unimportant in the two classifications. 
Similarly, ri\ and n 3 denote the number of true positive and false 
negative nodes, respectively. Pi, P 2 are therefore called false and 
true positive rates, respectively. P3 is called the accuracy of the 
new index. Given a T, one obtains a point (Pi, P 2 ). For Te[a,b], 
plotting the corresponding points in two dimensional coordinate 
system, we derive the ROC curve. The area under the curve 
(AUG) of ROC equals the probability that a classifier will rank a 
randomly chosen positive instance higher than a randomly chosen 
negative one [59], which can reflect the identification accuracy of 
the new index. The larger AUC, the more accurate of the index. 
Furthermore, the point in the upper left corner of a curve 
corresponds to the optimal threshold T, which gives the new 
classification of nodes with the highest P 3 . 

Hereinafter, based on the available information of some of the 
investigated networks and ROC curves, we evaluate the perfor- 
mance of P <ore and the other indexes. In the following, for 
simplicity, we transform node ranks into fractional ones (range in 
(0, 1]). For nodes with rank k, its fractional ranks are the ratio of 
the number of nodes with ranks no more than k to n. Obviously, 
nodes with smaller fractional ranks are more important. For the 
CEN, on one hand, we have mentioned that the 10 command 
interneurons are known to be very important. If we take them as 
important nodes, one derives the ROC curves for each index, as 
shown in Fig. 4(a). From Fig. 4(a), the in-degree, total-degree, 
P <ore and motif centrality all can well identify the command 
interneurons, the AUC (trapezoidal method) for these indexes are 
0.9991, 0.9985, 0.9974, 0.9967, which are all above 0.99. The 
P <ore is a little better than the motif centrality. The out-degree, 
PageRank and betweenness are all worse than the other indexes. 
On the other hand, neurons in the C. Elegans can be classified into 



interneurons, motor neurons, sensory neurons, where 1 1 7 neurons 
function as interneurons. If we take the 117 interneurons as 
important nodes, one obtains another ROC curve for each index, 
as shown in Fig. 4(b), where all the seven measures have roughly 
similar performance. The P <ore is a little better than the out- 
degree, in-degree, PageRank and betweenness. For the ECT, 
there are 7 key and totally 1 8 global regulators, which are actually 
important in the network. If we take the 7 key global regulators 
and 18 global regulators as actually important nodes, we derive 
two ROC curves for each index, as shown in Fgs.4(c) and (d). In 
Fig. 4(c), the AUC for the seven indexes are 0.9996, 0.4385, 
0.9996, 0.9997, 0.9983, 0.9987 and 0.9239. Except the in-degree 
and betweenness, all the indexes can well identify the key global 
regulators. F 00 ™ is a little better than the other indexes. From 
Fig. 4(d), the out-degree, total-degree, PageRank and motif 
centrality are with quite large AUC. The AUC for the P core is 
0.8628, which is only higher than that for the in-degree and 
betweenness, however, when T = 0.0036, the f core can classify the 
nodes in the ECT with P :i = 99.30%. 

For many biological networks, the actual classifications, known 
as gold standards, are not available. Fortunately, researchers have 
proposed several methods to evaluate the new test, such as 
constructing composite reference standards from available multi- 
ple tests [60,61]. A single ranking from either the in, out, total 
degrees, PageRank, motif centrality or betweenness is imperfect 
and can not act as a gold standard. Subsequently, for each 
network, we construct a composite reference standard based on 
the six rankings (Five in the HST), and evaluate the accuracy of 
P <ore . Specifically, in the composite reference standard, a node is 
defined as important if either one of the six rankings is among the 
top-To, where T Q is a threshold, which can be taken as 10%, 20% 
and so on. Thus, given a T () , we derive a dichotomous reference 
classification of nodes in the network, either positive (important) or 
negative (unimportant). According to the ranking from the f "" : , 
we take several threshold values T to reclassify nodes, and finally 
derive the ROC curves for each network, as shown in Fig. 5. 
Figs. 5(a) and (b) show the cases with T 0 = 10% and T o = 20%, 
respectively. In Fig. 5(a), the AUC for the five networks are 
0.8977, 0.8237, 0.9406, 0.8499 and 0.7878, respectively. The 
points in the upper left corner of the ROC curves in Fig. 5(a) 
correspond to P=20%, 5%, 10%, 10%, 5%, which lead to the 
highest P3. For example, for the DDT, when the top- 10% nodes 
are classified as important ones, the classification from the P core 
has the best consistency with the reference classification, the P 3 
can achieve 94.96%. For T 0 = 20%, the AUC for the five networks 
are 0.8740, 0.8884, 0.9521, 0.8955 and 0.7418, respectively. 
Under two different To and for different networks, the AUC are all 
above 0.74. Especially, in the DDT, the AUC is above 0.94, which 
indicates high identification accuracy of the proposed measure. 

It is noted that, for the DDT, HST and YT, since we still do not 
know how many nodes are actually important, it is difficult to 
compare among different measures via ROC curves. We also note 
that the ROC analysis without gold standards may subject to bias 
of the composite reference standard. However, since the composite 
reference standards for the five networks are based on six or five 
existing measures, it is trustworthy to treat them as reference 
standards. In conclusion, ROC analysis indicates the proposed 
measure is a remarkable alternative index to identify structurally 
important nodes in directed networks. 

Topological neighborhoods of several special nodes 

From the ROC analysis in the CEN and ECT, some measures 
are better than the P core in identifying the command interneurons 
or global regulators. Hereinafter, through the specific analysis on 
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Figure 4. ROC curves based on the available information in the CEN and ECT. (a) Performance of different indexes in identifying (a) the 10 
command interneurons in the CEN, (b) the 1 1 7 interneurons in the CEN, (c) the 7 key global regulators in the ECT, (d) the 18 global regulators in the 
ECT. 
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topological neighborhoods of several nodes, we further illustrate 
the merits of the proposed measure. According to p core s some hubs 
may be not important, whereas some non-hub nodes may be 
identified as very important ones. There are many highly 
connected but not highly ranked nodes, such as 946: soxs in the 



ECT; 22: b-catenin and 68: fak in the HST; 209: GCN4 and 332: 
MBP1-SWI6 in the YT. Examples of nodes with low degrees but 
ranked at top-20 include 333: cysG and 534-536:nirB-nirD in the 
ECT; 546: SSA4 and 587TKL2 in the YT. 
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Figure 5. Evaluation of / Jcore via ROC curves with composite reference standards for the five networks, (a) T 0 = 1 0%. A node is defined as 
important if either its rankings by the in, out, total degree, PageRank, motif centrality or the betweenness are at the top-To level, (b) Similarly to (a), 
but with To = 20%. 
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In the following, we take node 209 and 546 in the YT as two 
representative examples. Node 209 has out-degree 53 and in- 
degree 0, which is the second most important node according to 
the out and total degree, while its ranking is 62 according to P'" n \ 
Node 546 is with the in and total degree 4, the ranking is 28 
according to the total degree, but it is ranked as the eighth most 
important node by the P'"" , Tigs. 6 (a) and (b) visualize the 
topological neighborhoods of the two nodes with their nearest and 
second nearest neighbors. From the topological neighborhoods of 
the two nodes, there are 8 1 nodes involved in the neighborhood of 
node 209, which are connected by 111 directed edges that 
centered at node 209, while 114 nodes and totally 182 directed 
edges consist of the neighborhood of node 546. The connection 
density of the neighborhood of node 209 is much lower than node 
546. Moreover, from Fig. 6 (b), one can easily see that node 546 is 
directly regulated by four hub nodes and acts a bridge or 
bottleneck of the topological neighborhood. More importantly, the 
four hub neighbors of node 546 are just the top-4 nodes. Though 
node 209 can regulate 53 nodes, but its neighbors are neither hubs 
nor important nodes. Furthermore, node 546 involves in 1 203 bi- 
fan subgraphs in its topological neighborhood, while there are only 
39 such subgraphs for node 209, which indicate node 546 may 
play more functional roles in the system. Therefore, node 546 may 
be more important than 209. Finally, from the roles of biological 
functions, node 209 represents GCN4. It has been found that the 
GCN4 gene is conserved in S. cerevisiae, K. lactis, and E. gossypii 
[62]. SSA4 is widely conserved in human, chimpanzee, Rhesus 
monkey, dog, cow, mouse, rat, chicken, zebrafish, fruit fly, C. 
elegans, S. cerevisiae, and A. thaliana [62]. The cross species 
conservation of a gene indicates that it has been maintained by 
evolution despite speciation. It has been widely believed that 
mutation in a highly conserved gene can lead to a non-viable life 
form, or a form that is eliminated through natural selection 
[62,63]. SSA4 is more widely cross species conserved, which also 
indicates that SSA4 is more important than GCN4. Summing up, 
it is sufficient that the non-hub node 546 is actually more 
important than the hub node 209. 

From the above analysis, it seems that node 546 similarly 
functions as nodes with high betweenness in undirected networks. 
However, we note that there are great differences between 
directed and undirected networks. In the YT, the node 209 has 
only 53 outgoing edges and the node 546 has only 4 ingoing edges, 
the betweenness [19] of the two nodes are both zeros, which are 
the least important nodes according to it. Therefore, the P'"" is 
different from the classical betweenness. Furthermore, since the 
YT is a directed network, the betweenness of 96.06%(658/685) 
nodes are zeros, it fails to act as an effective ranking measure. It is 
noted that node 293: IME1 has the largest betweenness in the YT, 
whereas, it is not highly ranked according to P' OI '\ Fig. 6 (c) shows 
the topological neighborhood of node 293. Node 293 is with 5 
ingoing and 1 3 outgoing edges, but it is not frequently involved in 
network motifs. In conclusion, from the topological neighborhoods 
of several concrete nodes, we can further conclude that the 
proposed measure has its merits. 

Discussion 

Biological networks are typical real-world complex networks. It 
has been reported that a single measure is insufficient to 
distinguish lethal nodes clearly from viable ones in some biological 
networks [26,64]. Therefore, it is intriguing to find some more 
effective measures to characterize node differences in biological 
networks. In this paper, based on the integration of the 
occurrences of each node in 2, 3 and some 4-node network 



motifs, we have proposed a new measure to characterize node 
importance in biological networks. Based on ROC curves and the 
analysis of the topological neighborhoods of several specific nodes, 
we have compared the obtained results with that from the degree, 
PageRank, motif centrality and betweenness. 

In the CEN and YT, when the command interneurons, 
interneurons, key global regulators and global regulators are 
treated as actually important nodes, we compared the perfor- 
mance among different measures. The proposed measure has 
good performance in the two networks. The in-degree is good at 
identifying command interneurons in the CEN, but it is bad at 
finding global regulators in the ECT. The out-degree displays the 
contrary tendency as the in-degree. Though the PageRank can 
effectively identify the global regulators in the ECT, it is the worst 
measure in identifying command interneurons or interneurons in 
the CEN. Similarly, the betweenness is also not a good measure in 
the two networks. Therefore, the in-degree, out-degree, PageRank 
and betweenness are not robust indicators of important nodes in 
different networks. The p corc provides an alternative robust 
measure for different types of biological networks. 

Since the current knowledge on the five networks are limited, 
we note that it is still an open problem to further mining the 
advantages of the new measure. The number of command 
interneurons in the CEN and global regulators in the ECT are 
much fewer than the network sizes, the ROC analysis may suffer 
the effect of noise both in the interaction data and computation 
processes. We note that some other approaches may be used to 
further investigate the merits of the new measure, such as rich-cub 
analysis [8,65-69]. For simplicity, we simply examine the 
connectivity densities among the same amount of top-ranked 
nodes according to different measures in the ECT and HST, as 
shown in Fig. 7. Here, p(8) is defined as the ratio of the total 
actual number of edges to the maximum possible number of edges 
among the top- 1000% nodes. In Fig. 7, different curves corre- 
spond to different indexes. From Fig. 7, we can see that for many 
indexes, top-ranked nodes tend to be with higher connectivity 
densities than nodes ranked at the tail. The motif centrality fails to 
work in the HST, since the FFL is not a motif in such network. 
Moreover, comparing among different indexes, the P"'"- is very 
good at finding the cluster with high connectivity densities. That is, 
the connectivity density among a few motif-rich nodes are higher 
than the same number of top-ranked nodes by the other indexes. 
For example, in the HST, the connectivity density among the top- 
10% (0 = 0.1) motif-rich nodes is above 0.10, while the top- 10% 
large-degree nodes are with p(8) below 0.08. 

It has been reported that many bio-molecular networks are 
disassortative, which have negative Pearson correlation coefficients 
(PCCs) [6]. For example, the PCCs of the CEN and YT are - 
0.0537 and —0.3496. The disassortativity indicates that large- 
degree nodes would connect with low-degree ones rather than 
with each others. Whereas, nodes with high P 1 "" involve in many 
network motifs. Motif-rich nodes tend to form small connected 
subgraphs. Thus, the F"'" 1 may be helpful to find clusters with 
high connectivity density in disassortative networks. 

Finally, we note that this paper only considers five real-world 
biological networks, it is intriguing to further investigate the 
performance of the P'" re in some artificial networks, such as 
artificial scale-free, small-world networks and networks with 
community structures. It is noted that for networks with large 
cliques at the periphery, nodes in the cliques may have very high 
P" ,le - values, and therefore, these nodes may be highly ranked. 
Therefore, for such networks, the identified highly ranked clusters 
are probably just the large cliques. We will further investigate the 
related questions in our future works. 
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Figure 6. Topological neighborhoods of several nodes, (a) Topological neighborhood of a hub but not top-ranked node: node 209 in the YT. 

(b) Topological neighborhood of a non-hub but top-ranked node: node 546 in the YT. (c) Topological neighborhood of a not top-ranked node but 
with the highest betweenness: node 293 in the YT. 
doi:1 0.1 371 /journal.pone.01 061 32.g006 



Conclusions 

In this paper, based on network motifs and multivariate 
statistical analysis, we have proposed a novel measure to 
characterize node importance in directed biological networks. 
The new measure enable us to further mining undiscovered 
characteristics of nodes in directed biological networks. Through 



the new measure, we have investigated five real-world biological 
networks, which include a neural network, three transcriptional 
regulatory networks and one signal transduction network. These 
networks vary in sizes and link densities, and consist of various 
types of network motifs. 

Based on the proposed measure, we have identified important 
nodes in the five networks. Our investigations suggest that the 
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Figure 7. The curves of connectivity density p(0) against 0 for different ranking measures in the ECT and HST. 
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most important nodes in biological networks only take up a small 
fractions, but many of them are with important biological 
functions in real-world biological systems. Moreover, ROC 
analysis reveals that the proposed measure is a rather stable 
indicator of important nodes, and with very high prediction 
accuracy. Furthermore, the proposed measure can well charac- 
terize non-hub but very evolutionary conserved functional 
important nodes, and simultaneously, exclude hubs but not so 
functionally important nodes from the top rankings. Finally, we 
have discussed that the proposed measure may be used to reveal 
clusters with high connectivity density in disassortative networks. 
From these statistical analysis, we conclude that the proposed 
measure has some unique merits and it can be acted as an 
alternative network metric. 

Although we have mainly investigated some directed biological 
networks, the proposed measure can be extended to some other 
networks, such as electrical networks, social networks. It is also 
noted that the proposed measure can be extended to involve more 
types of network motifs, but with the increasing of motifs, the 
computational complexity will be increased. Moreover, if the FFL 
is the unique network motif in a directed network, the proposed 
method will degenerate into the motif centrality [25] . Lastly, we 
note that this paper provides an alternative way to characterize 
node features, it is still an open problem to find more effective 
ranking measures for nodes in directed biological networks, since it 
is generally difficult to obtain the actual rankings and a single 
measure is often insufficient to perfecdy characterize all nodes. 
The related researches can help us to identify the actual key nodes 
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